fix llama3 OOM issue and lm_head unsupport issue #2360

xin3he · 2025-12-09T08:41:48Z

PR Type

Bug fix, Enhancement, Documentation

Description

Added gpu_memory_utilization parameter to prevent OOM
Removed lm_head quantization to support vLLM inference
Updated README with notes on quantization and accuracy

Diagram Walkthrough

flowchart LR
  A["Add gpu_memory_utilization"] -- "Prevent OOM" --> B["Update run_benchmark.sh"]
  C["Remove lm_head quantization"] -- "Support vLLM inference" --> D["Update run_quant.sh"]
  E["Add notes on quantization"] -- "Update README.md" --> F["Document changes"]

File Walkthrough

Relevant files

Enhancement

run_benchmark.sh `Add gpu_memory_utilization parameter` examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh Added `gpu_memory_utilization=0.8` to `model_args`	+2/-2

Bug fix

run_quant.sh `Remove lm_head quantization` examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_quant.sh Removed `--quant_lm_head` from quantization commands	+3/-6

Documentation

README.md `Update README with quantization notes` examples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md Added notes on quantization accuracy and lm_head support	+8/-0

Signed-off-by: He, Xin3 <xin3.he@intel.com>

PRAgent4INC · 2025-12-09T08:42:35Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The removal of `--quant_lm_head` might affect the quantization process, especially if the `quant_lm_head` flag is necessary for certain configurations or models. CMD="python quantize.py --model_name_or_path \"$INPUT_MODEL\" $COMMON_ARGS --dtype MXFP8 --iters 0 --export_path \"$OUTPUT_MODEL\"" echo "Executing command: $CMD" python quantize.py \ --model_name_or_path "$INPUT_MODEL" \ $COMMON_ARGS \ --dtype MXFP8 \ --iters 0 \ --export_path "$OUTPUT_MODEL" ;; Hardcoded Value The `gpu_memory_utilization` value is hardcoded to `0.8`. This might not be suitable for all environments and could lead to suboptimal performance or OOM issues in different setups. local cmd="lm_eval --model vllm --model_args pretrained=\"$MODEL_PATH\",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=0.8,data_parallel_size=1 --tasks $tasks --batch_size $BATCH_SIZE" echo "Executing command: $cmd" lm_eval --model vllm \ --model_args pretrained="$MODEL_PATH",add_bos_token=$add_bos_token,tensor_parallel_size=$TENSOR_PARALLEL_SIZE,gpu_memory_utilization=0.8,data_parallel_size=1 \ --tasks $tasks \

PRAgent4INC · 2025-12-09T08:42:52Z

PR Code Suggestions ✨

Signed-off-by: He, Xin3 <xin3.he@intel.com>

fix OOM issue and lm_head unsupport issue

86193d7

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he changed the title ~~fix OOM issue and lm_head unsupport issue~~ fix llama3 OOM issue and lm_head unsupport issue Dec 9, 2025

PRAgent4INC added the Review effort 3/5 label Dec 9, 2025

xin3he added 4 commits December 11, 2025 04:02

mem from 0.8 to 0.65

c0522ec

Signed-off-by: He, Xin3 <xin3.he@intel.com>

adapt gpu_memory_utilization for mxfp4/8

ab03662

Signed-off-by: He, Xin3 <xin3.he@intel.com>

fix bug

6979ad1

Signed-off-by: He, Xin3 <xin3.he@intel.com>

reasonable batch size for time estimation

0628fd3

Signed-off-by: He, Xin3 <xin3.he@intel.com>

chensuyue added this to the 3.7 milestone Dec 15, 2025

xin3he added 7 commits December 15, 2025 21:48

increase target bits for llama3.3 70b mxfp4_mixed

ce2b6bf

Signed-off-by: He, Xin3 <xin3.he@intel.com>

fix typo

31b8940

Signed-off-by: He, Xin3 <xin3.he@intel.com>

add tuning for nvfp4

e754142

Signed-off-by: He, Xin3 <xin3.he@intel.com>

apply chat template for benchmark

194f61a

Signed-off-by: He, Xin3 <xin3.he@intel.com>

apply chat only for gsm8k

9e47ba7

Signed-off-by: He, Xin3 <xin3.he@intel.com>

change to mmlu_llama, gsm8k_llama

f0f200a

Signed-off-by: He, Xin3 <xin3.he@intel.com>

recover bits=5.8 and rtn for nvfp4

e1c5746

Signed-off-by: He, Xin3 <xin3.he@intel.com>

This was referenced Dec 19, 2025

NVFP4 tuning got device mismatch #2370

Closed

NVFP4 tuning got device mismatch intel/auto-round#1166

Closed

xin3he added 2 commits December 19, 2025 00:51

add autoround tuning

906c259

Signed-off-by: He, Xin3 <xin3.he@intel.com>

remove torch_compile for nvfp4

c198b0b

Signed-off-by: He, Xin3 <xin3.he@intel.com>

xin3he mentioned this pull request Dec 21, 2025

enable_torch_compile causes AttributeError: 'Linear' object has no attribute 'act_max' intel/auto-round#1109

Closed

add 1 more card for nvfp4 quant

100743a

Signed-off-by: He, Xin3 <xin3.he@intel.com>

chensuyue approved these changes Dec 23, 2025

View reviewed changes

chensuyue merged commit 997f7ed into master Dec 23, 2025
14 checks passed

chensuyue deleted the xinhe/vllm branch December 23, 2025 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix llama3 OOM issue and lm_head unsupport issue #2360

fix llama3 OOM issue and lm_head unsupport issue #2360

Uh oh!

xin3he commented Dec 9, 2025 •

edited by PRAgent4INC

Loading

Uh oh!

PRAgent4INC commented Dec 9, 2025

Uh oh!

PRAgent4INC commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix llama3 OOM issue and lm_head unsupport issue #2360

fix llama3 OOM issue and lm_head unsupport issue #2360

Uh oh!

Conversation

xin3he commented Dec 9, 2025 • edited by PRAgent4INC Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

PRAgent4INC commented Dec 9, 2025

PR Reviewer Guide 🔍

Uh oh!

PRAgent4INC commented Dec 9, 2025

PR Code Suggestions ✨

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xin3he commented Dec 9, 2025 •

edited by PRAgent4INC

Loading